Tidy Models - Regression

Predicting numerical outcomes using partial least squares on gasoline dataset

lruolin
03-09-2021

Overview

The partial least square (PLS) regression method may be used to predict one or more numerical Y outcome variable when the X variables are highly correlated. In the chemistry field, it is very useful for relating spectra to chemical properties.

The problem with using ordinary least square linear regression for such datasets is that when X variables are highly correlated, it is difficult to interpret coefficients of different X meaningfully, since they are all correlated.

PLS technique is also useful when the number of samples is lesser than the number of variables. It maximises the covariance of both X and Y to calculate scores (how much a particular object contributes to a latent variable) and loadings (variable coefficients used to define a component). For principal component regression (PCR), only X variables are taken into account. If the variability of X is not related to variability of Y, then PCR will have difficulty identifying a predictive relationship, when one might exist. It helps to uncover latent structures in the highly correlated X variables, so as to predict Y.

In a way, PLS is a supervised dimension reduction. It finds components that maximally summarise the variation of the predictors while simultaneiously requiring these components to have maximum correlation with the response.

The workflow below is for predicting one Y variable, using multiple X variables, on the gasoline NIR dataset. The outcome variable of interest is the octane number.

General workflow:

  1. Exploratory data analysis: Check if there are missing values, correlation of variables, identify the pre-processing steps required. How many variables are there? How are they distributed? Is it suitable for PLS?

The data may also be visualized before modelling.

  1. Split the dataset into training and testing dataset. The training dataset is used to build and tune the models, and the testing dataset is used to evaluate the model. Define a 10-fold cross validation dataset, from the training dataset. Resampling is useful if the sample size is small, to have a better bias-variance tradeoff (minimise over-fitting of model) for better predictive ability.

  2. Preprocess the training dataset. Data would have to be imputed if there are missing values, and also normalised since PLS is using variance to understand dissimilarity in the X variables.

  3. Train the model using the training dataset. In this case, the pls model is used. For PLS model, there is one tuning parameter, which is the number of components. The tune package will be used for tuning, and the data will be the 10-fold cross validation dataset.

  4. Assess the training model using root mean square error (RMSE) and mean absolute error (MAE), as well as r-sq for accuracy.

  5. Determine which variables are important in the model

  6. Predict new data

Packages required

library(pacman)
p_load(modeldata, pls, tidyverse, tidymodels, corrplot, 
       skimr, plsmod, ggthemes)

Import data

This dataset is from the pls package, and has octane number (Y outcome variable), as well as NIR spectra of 60 gasoline samples.

data(gasoline)

Visualize

data_plot <- cbind(gasoline, as.data.frame((unclass(gasoline$NIR)))) %>% 
  dplyr::select(-NIR) %>% 
  as_tibble() %>% 
  rowid_to_column() %>% 
  janitor::clean_names() %>% 
  pivot_longer(cols = starts_with("x"),
               names_to = "wavelength",
               values_to = "reflectance") %>% 
  mutate(wavelength_number = parse_number(wavelength))

data_plot %>% 
  ggplot(aes(x = wavelength_number, y = reflectance, col = factor(rowid))) +
  geom_line(show.legend = F) +
  labs(y = "log(1/R)",
       x = "nm",
       title = "Plot of NIR spectra for 60 gasoline samples") +
  coord_cartesian(xlim = c(900, 1800)) +
  theme_classic()

Modelling

To use the dataset for modelling using the tidymodels framework, convert the data into a tidy tibble structure. The pls package requires that the data be provided as a matrix format, but for the tidymodels framework, the data should be in a tibble format for ease in creating plots and modelling.

gasoline_tidy <-  cbind(gasoline, as.data.frame((unclass(gasoline$NIR)))) %>% 
  dplyr::select(-NIR) %>% 
  as_tibble() %>% 
  janitor::clean_names() 

glimpse(gasoline_tidy) # 60 rows, 402 columns
Rows: 60
Columns: 402
$ octane   <dbl> 85.30, 85.25, 88.45, 83.40, 87.90, 85.50, 88.90, 88…
$ x900_nm  <dbl> -0.050193, -0.044227, -0.046867, -0.046705, -0.0508…
$ x902_nm  <dbl> -0.045903, -0.039602, -0.041260, -0.042240, -0.0451…
$ x904_nm  <dbl> -0.042187, -0.035673, -0.036979, -0.038561, -0.0410…
$ x906_nm  <dbl> -0.037177, -0.030911, -0.031458, -0.034513, -0.0363…
$ x908_nm  <dbl> -0.033348, -0.026675, -0.026520, -0.030206, -0.0327…
$ x910_nm  <dbl> -0.031207, -0.023871, -0.023346, -0.027680, -0.0314…
$ x912_nm  <dbl> -0.030036, -0.022571, -0.021392, -0.026042, -0.0314…
$ x914_nm  <dbl> -0.031298, -0.025410, -0.024993, -0.028280, -0.0346…
$ x916_nm  <dbl> -0.034217, -0.028960, -0.029309, -0.030920, -0.0377…
$ x918_nm  <dbl> -0.036012, -0.032740, -0.033920, -0.034012, -0.0407…
$ x920_nm  <dbl> -0.039792, -0.036683, -0.038539, -0.037082, -0.0440…
$ x922_nm  <dbl> -0.043037, -0.040169, -0.042571, -0.040444, -0.0473…
$ x924_nm  <dbl> -0.047313, -0.044899, -0.047511, -0.044858, -0.0514…
$ x926_nm  <dbl> -0.048103, -0.046266, -0.048487, -0.046544, -0.0520…
$ x928_nm  <dbl> -0.050627, -0.048627, -0.050455, -0.048978, -0.0540…
$ x930_nm  <dbl> -0.053830, -0.052014, -0.053913, -0.052483, -0.0571…
$ x932_nm  <dbl> -0.054604, -0.053635, -0.055195, -0.054078, -0.0583…
$ x934_nm  <dbl> -0.056676, -0.055454, -0.056713, -0.055723, -0.0599…
$ x936_nm  <dbl> -0.058428, -0.056777, -0.057447, -0.056999, -0.0606…
$ x938_nm  <dbl> -0.060644, -0.059331, -0.059620, -0.059409, -0.0628…
$ x940_nm  <dbl> -0.061712, -0.060518, -0.060656, -0.060543, -0.0637…
$ x942_nm  <dbl> -0.063148, -0.062084, -0.061767, -0.061868, -0.0650…
$ x944_nm  <dbl> -0.064480, -0.063331, -0.062859, -0.062899, -0.0657…
$ x946_nm  <dbl> -0.065987, -0.065059, -0.064668, -0.064401, -0.0672…
$ x948_nm  <dbl> -0.066894, -0.065536, -0.065566, -0.064814, -0.0679…
$ x950_nm  <dbl> -0.068201, -0.066877, -0.066890, -0.065989, -0.0692…
$ x952_nm  <dbl> -0.069434, -0.067899, -0.068397, -0.067180, -0.0707…
$ x954_nm  <dbl> -0.069594, -0.067662, -0.068439, -0.066987, -0.0707…
$ x956_nm  <dbl> -0.071336, -0.069050, -0.070171, -0.068829, -0.0724…
$ x958_nm  <dbl> -0.071208, -0.068497, -0.070077, -0.069063, -0.0722…
$ x960_nm  <dbl> -0.071454, -0.068434, -0.070639, -0.069995, -0.0726…
$ x962_nm  <dbl> -0.071469, -0.067953, -0.070578, -0.070687, -0.0726…
$ x964_nm  <dbl> -0.071631, -0.067061, -0.070137, -0.070725, -0.0722…
$ x966_nm  <dbl> -0.070886, -0.066073, -0.069803, -0.070661, -0.0718…
$ x968_nm  <dbl> -0.071010, -0.065564, -0.069738, -0.071043, -0.0718…
$ x970_nm  <dbl> -0.070609, -0.065335, -0.070031, -0.071324, -0.0718…
$ x972_nm  <dbl> -0.070464, -0.064051, -0.069496, -0.070894, -0.0714…
$ x974_nm  <dbl> -0.070433, -0.063666, -0.069789, -0.071157, -0.0717…
$ x976_nm  <dbl> -0.069957, -0.062513, -0.069002, -0.070798, -0.0711…
$ x978_nm  <dbl> -0.070208, -0.061955, -0.069288, -0.071120, -0.0713…
$ x980_nm  <dbl> -0.070523, -0.060906, -0.069398, -0.071266, -0.0715…
$ x982_nm  <dbl> -0.070252, -0.059413, -0.068613, -0.070540, -0.0707…
$ x984_nm  <dbl> -0.069360, -0.058242, -0.067675, -0.069737, -0.0698…
$ x986_nm  <dbl> -0.068338, -0.056855, -0.066685, -0.068637, -0.0687…
$ x988_nm  <dbl> -0.066991, -0.055982, -0.065196, -0.067211, -0.0673…
$ x990_nm  <dbl> -0.065362, -0.055107, -0.063317, -0.065586, -0.0655…
$ x992_nm  <dbl> -0.063971, -0.054763, -0.061604, -0.064064, -0.0641…
$ x994_nm  <dbl> -0.062210, -0.054189, -0.059695, -0.062089, -0.0624…
$ x996_nm  <dbl> -0.060678, -0.054361, -0.058403, -0.060639, -0.0612…
$ x998_nm  <dbl> -0.059275, -0.054014, -0.056868, -0.059018, -0.0598…
$ x1000_nm <dbl> -0.059126, -0.054822, -0.056419, -0.058185, -0.0595…
$ x1002_nm <dbl> -0.058903, -0.055303, -0.056021, -0.057415, -0.0592…
$ x1004_nm <dbl> -0.058488, -0.055101, -0.055192, -0.056311, -0.0583…
$ x1006_nm <dbl> -0.057291, -0.054473, -0.053844, -0.054969, -0.0570…
$ x1008_nm <dbl> -0.055408, -0.053119, -0.051961, -0.052664, -0.0552…
$ x1010_nm <dbl> -0.053862, -0.051626, -0.050363, -0.050925, -0.0537…
$ x1012_nm <dbl> -0.051421, -0.048967, -0.047896, -0.048186, -0.0516…
$ x1014_nm <dbl> -0.050485, -0.047722, -0.046456, -0.046844, -0.0508…
$ x1016_nm <dbl> -0.048349, -0.045717, -0.044870, -0.045207, -0.0492…
$ x1018_nm <dbl> -0.047586, -0.044365, -0.044068, -0.044741, -0.0484…
$ x1020_nm <dbl> -0.046898, -0.043336, -0.043379, -0.044397, -0.0481…
$ x1022_nm <dbl> -0.047726, -0.043581, -0.044227, -0.044697, -0.0487…
$ x1024_nm <dbl> -0.048686, -0.044036, -0.045469, -0.045614, -0.0498…
$ x1026_nm <dbl> -0.049626, -0.044856, -0.046493, -0.046452, -0.0506…
$ x1028_nm <dbl> -0.050714, -0.046041, -0.048213, -0.047669, -0.0520…
$ x1030_nm <dbl> -0.051541, -0.046858, -0.049097, -0.048569, -0.0528…
$ x1032_nm <dbl> -0.052976, -0.048652, -0.050967, -0.050055, -0.0546…
$ x1034_nm <dbl> -0.054019, -0.049853, -0.052127, -0.051178, -0.0555…
$ x1036_nm <dbl> -0.055507, -0.051562, -0.053886, -0.052781, -0.0569…
$ x1038_nm <dbl> -0.056386, -0.052526, -0.054432, -0.053594, -0.0576…
$ x1040_nm <dbl> -0.056779, -0.053738, -0.055172, -0.054460, -0.0580…
$ x1042_nm <dbl> -0.056876, -0.054069, -0.055257, -0.054997, -0.0582…
$ x1044_nm <dbl> -0.056439, -0.053925, -0.054866, -0.055201, -0.0578…
$ x1046_nm <dbl> -0.056708, -0.054880, -0.055286, -0.055934, -0.0585…
$ x1048_nm <dbl> -0.056432, -0.054857, -0.055256, -0.055994, -0.0583…
$ x1050_nm <dbl> -0.056974, -0.055695, -0.055922, -0.056841, -0.0590…
$ x1052_nm <dbl> -0.057091, -0.055656, -0.055649, -0.056687, -0.0586…
$ x1054_nm <dbl> -0.057541, -0.056352, -0.056044, -0.057192, -0.0590…
$ x1056_nm <dbl> -0.057746, -0.056501, -0.056164, -0.057521, -0.0589…
$ x1058_nm <dbl> -0.058387, -0.057275, -0.056653, -0.058097, -0.0592…
$ x1060_nm <dbl> -0.058997, -0.057927, -0.057349, -0.058708, -0.0597…
$ x1062_nm <dbl> -0.058975, -0.058044, -0.057620, -0.058940, -0.0601…
$ x1064_nm <dbl> -0.059624, -0.058311, -0.058109, -0.059339, -0.0607…
$ x1066_nm <dbl> -0.059737, -0.058253, -0.057978, -0.059002, -0.0607…
$ x1068_nm <dbl> -0.060552, -0.058841, -0.058726, -0.059353, -0.0617…
$ x1070_nm <dbl> -0.060416, -0.058138, -0.058442, -0.058568, -0.0615…
$ x1072_nm <dbl> -0.061099, -0.058603, -0.058857, -0.058309, -0.0623…
$ x1074_nm <dbl> -0.060784, -0.058309, -0.058779, -0.057284, -0.0621…
$ x1076_nm <dbl> -0.061292, -0.058444, -0.059351, -0.057273, -0.0625…
$ x1078_nm <dbl> -0.061811, -0.058993, -0.060286, -0.057433, -0.0631…
$ x1080_nm <dbl> -0.061852, -0.059121, -0.060681, -0.057911, -0.0634…
$ x1082_nm <dbl> -0.062380, -0.059445, -0.061533, -0.059303, -0.0640…
$ x1084_nm <dbl> -0.062816, -0.059475, -0.061879, -0.060453, -0.0642…
$ x1086_nm <dbl> -0.063620, -0.060274, -0.062840, -0.062042, -0.0651…
$ x1088_nm <dbl> -0.063730, -0.060256, -0.063150, -0.062812, -0.0654…
$ x1090_nm <dbl> -0.064291, -0.060515, -0.063667, -0.063620, -0.0658…
$ x1092_nm <dbl> -0.064209, -0.060839, -0.063744, -0.063369, -0.0657…
$ x1094_nm <dbl> -0.064162, -0.060784, -0.063969, -0.063106, -0.0659…
$ x1096_nm <dbl> -0.063681, -0.060401, -0.063588, -0.062210, -0.0654…
$ x1098_nm <dbl> -0.062757, -0.059631, -0.062893, -0.061071, -0.0646…
$ x1100_nm <dbl> -0.061833, -0.058887, -0.062346, -0.060476, -0.0637…
$ x1102_nm <dbl> -0.060653, -0.057861, -0.060954, -0.059613, -0.0623…
$ x1104_nm <dbl> -0.059872, -0.057174, -0.060277, -0.059053, -0.0614…
$ x1106_nm <dbl> -0.058221, -0.055943, -0.058933, -0.057879, -0.0596…
$ x1108_nm <dbl> -0.057000, -0.054586, -0.057725, -0.056993, -0.0583…
$ x1110_nm <dbl> -0.054962, -0.052730, -0.056403, -0.055674, -0.0566…
$ x1112_nm <dbl> -0.052726, -0.050650, -0.054639, -0.053633, -0.0544…
$ x1114_nm <dbl> -0.050188, -0.047988, -0.052604, -0.051368, -0.0521…
$ x1116_nm <dbl> -0.047313, -0.045184, -0.050002, -0.048753, -0.0490…
$ x1118_nm <dbl> -0.044383, -0.042775, -0.047952, -0.046565, -0.0458…
$ x1120_nm <dbl> -0.040658, -0.039567, -0.044838, -0.043878, -0.0415…
$ x1122_nm <dbl> -0.036981, -0.036483, -0.041595, -0.041106, -0.0366…
$ x1124_nm <dbl> -0.030389, -0.031762, -0.036552, -0.036145, -0.0297…
$ x1126_nm <dbl> -0.022454, -0.026061, -0.030222, -0.029643, -0.0205…
$ x1128_nm <dbl> -0.011760, -0.017961, -0.021030, -0.020369, -0.0080…
$ x1130_nm <dbl> 0.000941, -0.008352, -0.010311, -0.009442, 0.006771…
$ x1132_nm <dbl> 0.018067, 0.004104, 0.003633, 0.005123, 0.026072, 0…
$ x1134_nm <dbl> 0.039563, 0.020129, 0.021227, 0.023279, 0.050595, 0…
$ x1136_nm <dbl> 0.065490, 0.040174, 0.043017, 0.045674, 0.080282, 0…
$ x1138_nm <dbl> 0.093330, 0.061190, 0.065780, 0.069009, 0.111046, 0…
$ x1140_nm <dbl> 0.119886, 0.082808, 0.088740, 0.092430, 0.140255, 0…
$ x1142_nm <dbl> 0.143335, 0.103309, 0.110123, 0.113654, 0.162922, 0…
$ x1144_nm <dbl> 0.161862, 0.121239, 0.128801, 0.131457, 0.178850, 0…
$ x1146_nm <dbl> 0.178692, 0.139799, 0.147827, 0.148927, 0.192453, 0…
$ x1148_nm <dbl> 0.190440, 0.154684, 0.163220, 0.162499, 0.200921, 0…
$ x1150_nm <dbl> 0.194380, 0.163895, 0.171823, 0.169830, 0.200736, 0…
$ x1152_nm <dbl> 0.187846, 0.163702, 0.170286, 0.168042, 0.189637, 0…
$ x1154_nm <dbl> 0.173436, 0.155861, 0.160042, 0.159366, 0.170796, 0…
$ x1156_nm <dbl> 0.158487, 0.146830, 0.147558, 0.149124, 0.152340, 0…
$ x1158_nm <dbl> 0.146299, 0.139954, 0.137904, 0.141988, 0.138688, 0…
$ x1160_nm <dbl> 0.143034, 0.140520, 0.136222, 0.142149, 0.133949, 0…
$ x1162_nm <dbl> 0.146884, 0.147677, 0.142161, 0.148574, 0.137171, 0…
$ x1164_nm <dbl> 0.156093, 0.159739, 0.153683, 0.159490, 0.146438, 0…
$ x1166_nm <dbl> 0.167857, 0.174511, 0.168846, 0.173304, 0.159196, 0…
$ x1168_nm <dbl> 0.179468, 0.189393, 0.184297, 0.186928, 0.172705, 0…
$ x1170_nm <dbl> 0.192422, 0.204829, 0.201057, 0.201306, 0.187194, 0…
$ x1172_nm <dbl> 0.205774, 0.222001, 0.219339, 0.217320, 0.202939, 0…
$ x1174_nm <dbl> 0.222009, 0.240784, 0.238474, 0.235025, 0.221091, 0…
$ x1176_nm <dbl> 0.238851, 0.259705, 0.257398, 0.253360, 0.239527, 0…
$ x1178_nm <dbl> 0.257624, 0.279918, 0.277868, 0.273140, 0.260289, 0…
$ x1180_nm <dbl> 0.282144, 0.305844, 0.303828, 0.298527, 0.286929, 0…
$ x1182_nm <dbl> 0.316637, 0.342649, 0.340386, 0.334143, 0.324194, 0…
$ x1184_nm <dbl> 0.360689, 0.390617, 0.388889, 0.379790, 0.370256, 0…
$ x1186_nm <dbl> 0.410005, 0.445219, 0.444171, 0.431122, 0.418339, 0…
$ x1188_nm <dbl> 0.454409, 0.497316, 0.497294, 0.478623, 0.458811, 0…
$ x1190_nm <dbl> 0.485965, 0.535230, 0.535431, 0.512772, 0.480722, 0…
$ x1192_nm <dbl> 0.497682, 0.552865, 0.552290, 0.529281, 0.483581, 0…
$ x1194_nm <dbl> 0.489535, 0.545858, 0.541118, 0.523347, 0.465097, 0…
$ x1196_nm <dbl> 0.466098, 0.516133, 0.504049, 0.497873, 0.432696, 0…
$ x1198_nm <dbl> 0.432583, 0.474097, 0.453667, 0.462310, 0.392652, 0…
$ x1200_nm <dbl> 0.394805, 0.429533, 0.400454, 0.423781, 0.351462, 0…
$ x1202_nm <dbl> 0.359732, 0.388781, 0.353163, 0.387603, 0.315084, 0…
$ x1204_nm <dbl> 0.331966, 0.357836, 0.318767, 0.359001, 0.288042, 0…
$ x1206_nm <dbl> 0.309186, 0.334257, 0.293785, 0.336412, 0.267363, 0…
$ x1208_nm <dbl> 0.292567, 0.316834, 0.276645, 0.319126, 0.251579, 0…
$ x1210_nm <dbl> 0.275393, 0.300208, 0.260872, 0.301999, 0.235251, 0…
$ x1212_nm <dbl> 0.255452, 0.280041, 0.242538, 0.282530, 0.216596, 0…
$ x1214_nm <dbl> 0.235247, 0.259029, 0.224122, 0.261341, 0.198460, 0…
$ x1216_nm <dbl> 0.213967, 0.237029, 0.206128, 0.239522, 0.181406, 0…
$ x1218_nm <dbl> 0.192011, 0.215364, 0.188049, 0.218119, 0.165114, 0…
$ x1220_nm <dbl> 0.170424, 0.192902, 0.169174, 0.195463, 0.148187, 0…
$ x1222_nm <dbl> 0.150943, 0.171982, 0.152376, 0.174744, 0.133288, 0…
$ x1224_nm <dbl> 0.132428, 0.151256, 0.136216, 0.154083, 0.118446, 0…
$ x1226_nm <dbl> 0.115921, 0.133448, 0.121756, 0.135921, 0.105215, 0…
$ x1228_nm <dbl> 0.099535, 0.115749, 0.106574, 0.118325, 0.090738, 0…
$ x1230_nm <dbl> 0.084481, 0.099265, 0.090796, 0.101493, 0.076775, 0…
$ x1232_nm <dbl> 0.071254, 0.084244, 0.076286, 0.086168, 0.063483, 0…
$ x1234_nm <dbl> 0.058049, 0.070886, 0.062592, 0.072287, 0.050650, 0…
$ x1236_nm <dbl> 0.047446, 0.060095, 0.051793, 0.061218, 0.040134, 0…
$ x1238_nm <dbl> 0.036880, 0.049378, 0.041423, 0.050044, 0.030024, 0…
$ x1240_nm <dbl> 0.029082, 0.040630, 0.033518, 0.040826, 0.021975, 0…
$ x1242_nm <dbl> 0.021789, 0.032988, 0.026652, 0.032597, 0.015352, 0…
$ x1244_nm <dbl> 0.016038, 0.026985, 0.021066, 0.026030, 0.010021, 0…
$ x1246_nm <dbl> 0.011251, 0.021400, 0.016164, 0.020097, 0.005572, 0…
$ x1248_nm <dbl> 0.006186, 0.016041, 0.011051, 0.014526, 0.001400, 0…
$ x1250_nm <dbl> 0.003116, 0.012094, 0.007728, 0.010441, -0.001737, …
$ x1252_nm <dbl> -0.000998, 0.007511, 0.003719, 0.005578, -0.005442,…
$ x1254_nm <dbl> -0.004703, 0.003694, -0.000068, 0.001684, -0.008756…
$ x1256_nm <dbl> -0.009270, -0.001214, -0.004753, -0.002925, -0.0128…
$ x1258_nm <dbl> -0.012907, -0.005412, -0.008519, -0.007083, -0.0162…
$ x1260_nm <dbl> -0.017234, -0.010057, -0.012552, -0.011555, -0.0201…
$ x1262_nm <dbl> -0.020832, -0.014084, -0.016265, -0.015370, -0.0235…
$ x1264_nm <dbl> -0.024051, -0.017764, -0.019766, -0.018619, -0.0270…
$ x1266_nm <dbl> -0.027297, -0.021488, -0.023338, -0.021719, -0.0301…
$ x1268_nm <dbl> -0.029792, -0.024172, -0.025883, -0.023654, -0.0323…
$ x1270_nm <dbl> -0.032400, -0.027224, -0.029186, -0.025991, -0.0353…
$ x1272_nm <dbl> -0.033746, -0.029156, -0.031037, -0.027169, -0.0369…
$ x1274_nm <dbl> -0.035721, -0.031021, -0.033466, -0.028784, -0.0383…
$ x1276_nm <dbl> -0.036701, -0.032030, -0.034469, -0.029554, -0.0394…
$ x1278_nm <dbl> -0.037363, -0.032565, -0.035675, -0.030701, -0.0404…
$ x1280_nm <dbl> -0.037377, -0.032049, -0.035766, -0.031122, -0.0405…
$ x1282_nm <dbl> -0.037489, -0.031394, -0.036086, -0.031775, -0.0405…
$ x1284_nm <dbl> -0.037695, -0.031097, -0.036513, -0.032641, -0.0406…
$ x1286_nm <dbl> -0.036906, -0.029836, -0.036003, -0.032724, -0.0402…
$ x1288_nm <dbl> -0.037094, -0.029678, -0.036171, -0.033748, -0.0403…
$ x1290_nm <dbl> -0.036752, -0.028968, -0.035911, -0.034132, -0.0398…
$ x1292_nm <dbl> -0.037211, -0.029269, -0.036362, -0.035013, -0.0398…
$ x1294_nm <dbl> -0.037308, -0.029262, -0.036318, -0.035620, -0.0397…
$ x1296_nm <dbl> -0.037673, -0.029877, -0.036997, -0.036407, -0.0400…
$ x1298_nm <dbl> -0.037877, -0.030030, -0.037092, -0.036595, -0.0400…
$ x1300_nm <dbl> -0.038212, -0.030565, -0.037218, -0.036742, -0.0400…
$ x1302_nm <dbl> -0.038426, -0.031404, -0.037411, -0.037076, -0.0401…
$ x1304_nm <dbl> -0.038534, -0.032320, -0.037429, -0.037230, -0.0403…
$ x1306_nm <dbl> -0.039103, -0.033693, -0.038010, -0.037877, -0.0411…
$ x1308_nm <dbl> -0.038933, -0.034155, -0.037523, -0.037530, -0.0406…
$ x1310_nm <dbl> -0.039611, -0.035496, -0.038127, -0.038428, -0.0415…
$ x1312_nm <dbl> -0.039168, -0.035629, -0.037544, -0.037948, -0.0407…
$ x1314_nm <dbl> -0.038598, -0.035932, -0.037255, -0.037560, -0.0402…
$ x1316_nm <dbl> -0.037733, -0.035289, -0.036148, -0.036566, -0.0390…
$ x1318_nm <dbl> -0.037027, -0.034887, -0.035291, -0.035762, -0.0379…
$ x1320_nm <dbl> -0.035891, -0.034467, -0.034674, -0.035156, -0.0374…
$ x1322_nm <dbl> -0.034290, -0.033427, -0.033358, -0.033697, -0.0361…
$ x1324_nm <dbl> -0.033224, -0.032634, -0.032758, -0.032653, -0.0353…
$ x1326_nm <dbl> -0.031475, -0.030815, -0.031228, -0.030570, -0.0334…
$ x1328_nm <dbl> -0.030071, -0.029493, -0.030097, -0.028914, -0.0324…
$ x1330_nm <dbl> -0.028086, -0.027298, -0.027881, -0.026203, -0.0302…
$ x1332_nm <dbl> -0.026779, -0.026081, -0.026697, -0.024685, -0.0290…
$ x1334_nm <dbl> -0.024450, -0.024199, -0.024809, -0.022470, -0.0267…
$ x1336_nm <dbl> -0.022709, -0.022546, -0.022950, -0.020930, -0.0245…
$ x1338_nm <dbl> -0.020088, -0.020154, -0.020248, -0.018910, -0.0215…
$ x1340_nm <dbl> -0.016814, -0.017032, -0.016845, -0.016121, -0.0179…
$ x1342_nm <dbl> -0.013638, -0.013717, -0.013423, -0.013279, -0.0139…
$ x1344_nm <dbl> -0.008755, -0.008264, -0.007559, -0.008334, -0.0080…
$ x1346_nm <dbl> -0.003402, -0.002064, -0.000991, -0.003372, -0.0018…
$ x1348_nm <dbl> 0.004585, 0.007304, 0.008786, 0.004753, 0.007269, 0…
$ x1350_nm <dbl> 0.015708, 0.020121, 0.022739, 0.016161, 0.019779, 0…
$ x1352_nm <dbl> 0.032342, 0.039083, 0.043209, 0.033429, 0.038201, 0…
$ x1354_nm <dbl> 0.055075, 0.064193, 0.069797, 0.056411, 0.062477, 0…
$ x1356_nm <dbl> 0.083261, 0.095086, 0.102595, 0.085082, 0.092206, 0…
$ x1358_nm <dbl> 0.114275, 0.128027, 0.137954, 0.116397, 0.123556, 0…
$ x1360_nm <dbl> 0.142234, 0.158800, 0.169100, 0.145471, 0.151961, 0…
$ x1362_nm <dbl> 0.165117, 0.183265, 0.194530, 0.168918, 0.174343, 0…
$ x1364_nm <dbl> 0.181014, 0.201249, 0.212629, 0.185935, 0.189944, 0…
$ x1366_nm <dbl> 0.194140, 0.216701, 0.227677, 0.200819, 0.202222, 0…
$ x1368_nm <dbl> 0.207728, 0.232922, 0.243204, 0.215868, 0.215936, 0…
$ x1370_nm <dbl> 0.225084, 0.252380, 0.262941, 0.234213, 0.233155, 0…
$ x1372_nm <dbl> 0.245205, 0.274944, 0.285764, 0.255771, 0.254445, 0…
$ x1374_nm <dbl> 0.267331, 0.298853, 0.309733, 0.279190, 0.277749, 0…
$ x1376_nm <dbl> 0.290558, 0.322715, 0.333065, 0.303732, 0.301495, 0…
$ x1378_nm <dbl> 0.311237, 0.342958, 0.352530, 0.325323, 0.320695, 0…
$ x1380_nm <dbl> 0.328961, 0.360426, 0.369372, 0.344871, 0.336187, 0…
$ x1382_nm <dbl> 0.341901, 0.372804, 0.380216, 0.359555, 0.345160, 0…
$ x1384_nm <dbl> 0.350544, 0.381960, 0.387849, 0.370893, 0.350677, 0…
$ x1386_nm <dbl> 0.357751, 0.389191, 0.394677, 0.378461, 0.355522, 0…
$ x1388_nm <dbl> 0.364172, 0.397112, 0.402078, 0.386247, 0.362320, 0…
$ x1390_nm <dbl> 0.370471, 0.403062, 0.408582, 0.392268, 0.368313, 0…
$ x1392_nm <dbl> 0.373154, 0.406006, 0.410932, 0.393815, 0.370052, 0…
$ x1394_nm <dbl> 0.368176, 0.401295, 0.406170, 0.388471, 0.365085, 0…
$ x1396_nm <dbl> 0.356007, 0.389064, 0.393839, 0.376881, 0.352979, 0…
$ x1398_nm <dbl> 0.341120, 0.373591, 0.376215, 0.360814, 0.337215, 0…
$ x1400_nm <dbl> 0.325587, 0.356204, 0.357096, 0.344653, 0.321339, 0…
$ x1402_nm <dbl> 0.313571, 0.342050, 0.340687, 0.331744, 0.308459, 0…
$ x1404_nm <dbl> 0.306847, 0.332780, 0.329190, 0.323812, 0.299508, 0…
$ x1406_nm <dbl> 0.305310, 0.329005, 0.322999, 0.321500, 0.295653, 0…
$ x1408_nm <dbl> 0.304471, 0.328951, 0.321164, 0.323119, 0.295204, 0…
$ x1410_nm <dbl> 0.304125, 0.328051, 0.319110, 0.322928, 0.292860, 0…
$ x1412_nm <dbl> 0.298698, 0.322452, 0.311924, 0.317972, 0.286862, 0…
$ x1414_nm <dbl> 0.285589, 0.308979, 0.298671, 0.305534, 0.274217, 0…
$ x1416_nm <dbl> 0.267933, 0.290704, 0.280373, 0.288311, 0.257335, 0…
$ x1418_nm <dbl> 0.247573, 0.269241, 0.258819, 0.267978, 0.237130, 0…
$ x1420_nm <dbl> 0.229774, 0.249374, 0.238800, 0.248524, 0.219594, 0…
$ x1422_nm <dbl> 0.215690, 0.233860, 0.222677, 0.232934, 0.205760, 0…
$ x1424_nm <dbl> 0.207192, 0.224821, 0.213798, 0.223277, 0.197859, 0…
$ x1426_nm <dbl> 0.202051, 0.220504, 0.209824, 0.217777, 0.195243, 0…
$ x1428_nm <dbl> 0.201819, 0.220685, 0.210649, 0.216610, 0.196015, 0…
$ x1430_nm <dbl> 0.202353, 0.222163, 0.212908, 0.216723, 0.197585, 0…
$ x1432_nm <dbl> 0.201612, 0.222347, 0.214331, 0.216083, 0.198133, 0…
$ x1434_nm <dbl> 0.199291, 0.221434, 0.214048, 0.214255, 0.196885, 0…
$ x1436_nm <dbl> 0.193770, 0.215752, 0.209266, 0.208305, 0.190623, 0…
$ x1438_nm <dbl> 0.186669, 0.209041, 0.202558, 0.201097, 0.184152, 0…
$ x1440_nm <dbl> 0.178426, 0.200297, 0.193675, 0.192070, 0.175524, 0…
$ x1442_nm <dbl> 0.171231, 0.192172, 0.185359, 0.183795, 0.167989, 0…
$ x1444_nm <dbl> 0.164144, 0.183974, 0.176963, 0.176042, 0.159801, 0…
$ x1446_nm <dbl> 0.157259, 0.175816, 0.169181, 0.168643, 0.152625, 0…
$ x1448_nm <dbl> 0.149194, 0.167254, 0.159964, 0.160181, 0.144600, 0…
$ x1450_nm <dbl> 0.139789, 0.156358, 0.149142, 0.150664, 0.134160, 0…
$ x1452_nm <dbl> 0.130173, 0.146486, 0.139082, 0.141494, 0.125008, 0…
$ x1454_nm <dbl> 0.120745, 0.137275, 0.129739, 0.133234, 0.115873, 0…
$ x1456_nm <dbl> 0.113733, 0.130180, 0.122022, 0.126515, 0.109185, 0…
$ x1458_nm <dbl> 0.107519, 0.123365, 0.116297, 0.120700, 0.103171, 0…
$ x1460_nm <dbl> 0.103765, 0.119438, 0.113173, 0.117645, 0.100027, 0…
$ x1462_nm <dbl> 0.101175, 0.116286, 0.110954, 0.115052, 0.098121, 0…
$ x1464_nm <dbl> 0.100106, 0.114482, 0.110095, 0.113629, 0.097745, 0…
$ x1466_nm <dbl> 0.099068, 0.112444, 0.109058, 0.112397, 0.097652, 0…
$ x1468_nm <dbl> 0.097315, 0.108714, 0.106665, 0.108961, 0.096193, 0…
$ x1470_nm <dbl> 0.095724, 0.106003, 0.104405, 0.106436, 0.095738, 0…
$ x1472_nm <dbl> 0.093093, 0.102389, 0.100941, 0.102798, 0.093557, 0…
$ x1474_nm <dbl> 0.090002, 0.098235, 0.097474, 0.098874, 0.090709, 0…
$ x1476_nm <dbl> 0.086612, 0.093959, 0.093497, 0.094944, 0.087039, 0…
$ x1478_nm <dbl> 0.084922, 0.091774, 0.091673, 0.092621, 0.085000, 0…
$ x1480_nm <dbl> 0.080780, 0.087636, 0.087493, 0.088422, 0.080375, 0…
$ x1482_nm <dbl> 0.077849, 0.084646, 0.084593, 0.084868, 0.077113, 0…
$ x1484_nm <dbl> 0.074536, 0.082097, 0.081474, 0.081632, 0.073986, 0…
$ x1486_nm <dbl> 0.070910, 0.079089, 0.078340, 0.078056, 0.069980, 0…
$ x1488_nm <dbl> 0.068700, 0.077735, 0.076682, 0.076356, 0.067515, 0…
$ x1490_nm <dbl> 0.064584, 0.074444, 0.073916, 0.072892, 0.064092, 0…
$ x1492_nm <dbl> 0.061726, 0.071813, 0.071351, 0.069921, 0.060768, 0…
$ x1494_nm <dbl> 0.057395, 0.068586, 0.068287, 0.066060, 0.057105, 0…
$ x1496_nm <dbl> 0.053704, 0.065580, 0.065568, 0.062555, 0.054163, 0…
$ x1498_nm <dbl> 0.049329, 0.061501, 0.061216, 0.057895, 0.050508, 0…
$ x1500_nm <dbl> 0.044695, 0.056740, 0.055821, 0.052995, 0.045830, 0…
$ x1502_nm <dbl> 0.039937, 0.051068, 0.049529, 0.047908, 0.040213, 0…
$ x1504_nm <dbl> 0.034641, 0.045104, 0.043150, 0.042637, 0.034258, 0…
$ x1506_nm <dbl> 0.030164, 0.040192, 0.037184, 0.038187, 0.029202, 0…
$ x1508_nm <dbl> 0.024898, 0.034145, 0.030828, 0.032880, 0.023555, 0…
$ x1510_nm <dbl> 0.021588, 0.029970, 0.026308, 0.029345, 0.019996, 0…
$ x1512_nm <dbl> 0.017102, 0.024131, 0.020805, 0.024026, 0.015186, 0…
$ x1514_nm <dbl> 0.014860, 0.020670, 0.016907, 0.020904, 0.012201, 0…
$ x1516_nm <dbl> 0.010854, 0.016931, 0.013194, 0.017431, 0.008836, 0…
$ x1518_nm <dbl> 0.008205, 0.013083, 0.009759, 0.014136, 0.005561, 0…
$ x1520_nm <dbl> 0.005869, 0.010199, 0.007271, 0.011749, 0.003463, 0…
$ x1522_nm <dbl> 0.003920, 0.007926, 0.004796, 0.009634, 0.001051, 0…
$ x1524_nm <dbl> 0.002107, 0.005733, 0.002720, 0.007699, -0.001164, …
$ x1526_nm <dbl> -0.000203, 0.002833, 0.000243, 0.005545, -0.003490,…
$ x1528_nm <dbl> -0.001072, 0.001596, -0.001196, 0.004693, -0.004459…
$ x1530_nm <dbl> -0.003325, -0.000928, -0.003965, 0.002553, -0.00740…
$ x1532_nm <dbl> -0.004707, -0.002545, -0.005704, 0.001494, -0.00901…
$ x1534_nm <dbl> -0.006635, -0.003656, -0.007156, -0.000102, -0.0101…
$ x1536_nm <dbl> -0.008403, -0.006217, -0.009860, -0.003056, -0.0127…
$ x1538_nm <dbl> -0.009317, -0.006885, -0.010959, -0.004391, -0.0138…
$ x1540_nm <dbl> -0.011351, -0.008307, -0.012194, -0.006868, -0.0151…
$ x1542_nm <dbl> -0.012280, -0.009142, -0.013605, -0.008665, -0.0158…
$ x1544_nm <dbl> -0.013474, -0.010409, -0.014564, -0.010355, -0.0171…
$ x1546_nm <dbl> -0.014453, -0.010983, -0.015313, -0.011261, -0.0175…
$ x1548_nm <dbl> -0.015262, -0.012033, -0.015723, -0.011896, -0.0181…
$ x1550_nm <dbl> -0.015588, -0.012020, -0.015263, -0.011357, -0.0179…
$ x1552_nm <dbl> -0.016349, -0.013442, -0.016537, -0.012562, -0.0191…
$ x1554_nm <dbl> -0.016028, -0.012709, -0.015581, -0.011408, -0.0183…
$ x1556_nm <dbl> -0.016209, -0.011563, -0.014147, -0.010164, -0.0173…
$ x1558_nm <dbl> -0.015985, -0.010432, -0.013280, -0.009194, -0.0168…
$ x1560_nm <dbl> -0.015483, -0.010091, -0.013291, -0.009642, -0.0167…
$ x1562_nm <dbl> -0.014216, -0.007960, -0.012268, -0.008700, -0.0155…
$ x1564_nm <dbl> -0.013790, -0.006697, -0.011176, -0.007899, -0.0145…
$ x1566_nm <dbl> -0.013844, -0.005367, -0.010897, -0.006986, -0.0146…
$ x1568_nm <dbl> -0.011773, -0.003432, -0.009383, -0.005475, -0.0131…
$ x1570_nm <dbl> -0.010432, -0.002665, -0.008983, -0.004838, -0.0130…
$ x1572_nm <dbl> -0.009606, -0.001307, -0.007542, -0.003785, -0.0112…
$ x1574_nm <dbl> -0.008209, -0.000406, -0.005716, -0.002646, -0.0101…
$ x1576_nm <dbl> -0.006732, 0.001300, -0.003888, -0.001373, -0.00784…
$ x1578_nm <dbl> -0.004573, 0.003107, -0.001115, 0.000559, -0.005406…
$ x1580_nm <dbl> -0.003433, 0.003075, 0.000176, 0.001422, -0.004259,…
$ x1582_nm <dbl> -0.001200, 0.004513, 0.002464, 0.003357, -0.002039,…
$ x1584_nm <dbl> -0.000066, 0.005336, 0.003951, 0.005481, -0.000655,…
$ x1586_nm <dbl> 0.002759, 0.008496, 0.008541, 0.010048, 0.003930, 0…
$ x1588_nm <dbl> 0.004036, 0.009262, 0.009676, 0.011528, 0.004579, 0…
$ x1590_nm <dbl> 0.006620, 0.010982, 0.012153, 0.014357, 0.007207, 0…
$ x1592_nm <dbl> 0.008709, 0.012330, 0.013639, 0.015642, 0.008612, 0…
$ x1594_nm <dbl> 0.011810, 0.015301, 0.016898, 0.018304, 0.012129, 0…
$ x1596_nm <dbl> 0.014426, 0.018270, 0.019847, 0.020612, 0.015541, 0…
$ x1598_nm <dbl> 0.017254, 0.020912, 0.022470, 0.022399, 0.018535, 0…
$ x1600_nm <dbl> 0.020633, 0.024617, 0.026069, 0.025038, 0.022522, 0…
$ x1602_nm <dbl> 0.025032, 0.028188, 0.029714, 0.027860, 0.026739, 0…
$ x1604_nm <dbl> 0.029496, 0.033321, 0.034281, 0.032481, 0.032223, 0…
$ x1606_nm <dbl> 0.034242, 0.037755, 0.039133, 0.036772, 0.037103, 0…
$ x1608_nm <dbl> 0.039735, 0.043515, 0.045228, 0.042282, 0.043411, 0…
$ x1610_nm <dbl> 0.046803, 0.050530, 0.051833, 0.049100, 0.051336, 0…
$ x1612_nm <dbl> 0.053743, 0.058033, 0.059036, 0.056030, 0.059517, 0…
$ x1614_nm <dbl> 0.062788, 0.067009, 0.068655, 0.065351, 0.069594, 0…
$ x1616_nm <dbl> 0.073531, 0.077781, 0.079319, 0.075811, 0.081145, 0…
$ x1618_nm <dbl> 0.087246, 0.091191, 0.092339, 0.089083, 0.096240, 0…
$ x1620_nm <dbl> 0.104000, 0.106815, 0.107952, 0.105323, 0.112862, 0…
$ x1622_nm <dbl> 0.124210, 0.123946, 0.123435, 0.123006, 0.131164, 0…
$ x1624_nm <dbl> 0.147924, 0.146013, 0.143631, 0.145753, 0.153401, 0…
$ x1626_nm <dbl> 0.175024, 0.171884, 0.166498, 0.173142, 0.178326, 0…
$ x1628_nm <dbl> 0.202711, 0.197074, 0.188171, 0.200956, 0.202094, 0…
$ x1630_nm <dbl> 0.232547, 0.224837, 0.210848, 0.231804, 0.228229, 0…
$ x1632_nm <dbl> 0.259828, 0.250088, 0.232041, 0.259687, 0.252492, 0…
$ x1634_nm <dbl> 0.282117, 0.270191, 0.249246, 0.281929, 0.273077, 0…
$ x1636_nm <dbl> 0.299937, 0.284011, 0.261166, 0.297769, 0.290969, 0…
$ x1638_nm <dbl> 0.313416, 0.290728, 0.271317, 0.308358, 0.307760, 0…
$ x1640_nm <dbl> 0.322870, 0.294182, 0.276957, 0.311430, 0.321736, 0…
$ x1642_nm <dbl> 0.327217, 0.293247, 0.279662, 0.311149, 0.331647, 0…
$ x1644_nm <dbl> 0.331907, 0.289916, 0.280366, 0.308663, 0.337756, 0…
$ x1646_nm <dbl> 0.336522, 0.291554, 0.284626, 0.308939, 0.344823, 0…
$ x1648_nm <dbl> 0.341847, 0.293929, 0.289691, 0.311825, 0.352189, 0…
$ x1650_nm <dbl> 0.350585, 0.301816, 0.297987, 0.319857, 0.362563, 0…
$ x1652_nm <dbl> 0.368032, 0.316670, 0.312909, 0.335700, 0.379202, 0…
$ x1654_nm <dbl> 0.393494, 0.336533, 0.333932, 0.355742, 0.404428, 0…
$ x1656_nm <dbl> 0.427370, 0.363209, 0.362284, 0.384754, 0.441534, 0…
$ x1658_nm <dbl> 0.472357, 0.399271, 0.400352, 0.424769, 0.494011, 0…
$ x1660_nm <dbl> 0.528721, 0.443964, 0.445686, 0.472003, 0.557655, 0…
$ x1662_nm <dbl> 0.593168, 0.499024, 0.504338, 0.531050, 0.635493, 0…
$ x1664_nm <dbl> 0.666647, 0.559683, 0.567039, 0.596606, 0.721107, 0…
$ x1666_nm <dbl> 0.743945, 0.624577, 0.635874, 0.662997, 0.803044, 0…
$ x1668_nm <dbl> 0.833972, 0.697450, 0.711690, 0.744126, 0.893778, 0…
$ x1670_nm <dbl> 0.909368, 0.770433, 0.792939, 0.817398, 0.968985, 0…
$ x1672_nm <dbl> 0.985332, 0.848769, 0.872021, 0.895422, 1.050312, 0…
$ x1674_nm <dbl> 1.051159, 0.916590, 0.938915, 0.959761, 1.104573, 1…
$ x1676_nm <dbl> 1.107395, 0.975048, 1.004690, 1.021595, 1.152964, 1…
$ x1678_nm <dbl> 1.158488, 1.036499, 1.067934, 1.082497, 1.192131, 1…
$ x1680_nm <dbl> 1.174795, 1.085067, 1.108813, 1.117447, 1.219254, 1…
$ x1682_nm <dbl> 1.198461, 1.128877, 1.147964, 1.160089, 1.252712, 1…
$ x1684_nm <dbl> 1.224243, 1.148342, 1.167798, 1.169350, 1.238013, 1…
$ x1686_nm <dbl> 1.242645, 1.189116, 1.198287, 1.201066, 1.259616, 1…
$ x1688_nm <dbl> 1.250789, 1.223242, 1.237383, 1.233299, 1.273713, 1…
$ x1690_nm <dbl> 1.246626, 1.253306, 1.260979, 1.262966, 1.296524, 1…
$ x1692_nm <dbl> 1.250985, 1.282889, 1.276677, 1.272709, 1.299507, 1…
$ x1694_nm <dbl> 1.264189, 1.215065, 1.218871, 1.211068, 1.226448, 1…
$ x1696_nm <dbl> 1.244678, 1.225211, 1.223132, 1.215044, 1.230718, 1…
$ x1698_nm <dbl> 1.245913, 1.227985, 1.230321, 1.232655, 1.232864, 1…
$ x1700_nm <dbl> 1.221135, 1.198851, 1.208742, 1.206696, 1.202926, 1…

EDA

Check for missing values using purrr::map()

gasoline_tidy %>% 
  purrr::map(is.na) %>% 
  map_df(sum) %>% 
  tidy() %>% 
  dplyr::select(column, mean) %>% 
  as_tibble() %>% 
  filter(mean>0) # no missing values
# A tibble: 0 x 2
# … with 2 variables: column <chr>, mean <dbl>

For other datasets, one should check the correlation again, but in this case, as this is a NIR dataset, I will not do this step.

Split dataset

# initial split
set.seed(20210308)
gasoline_split <- initial_split(gasoline_tidy, prop = 0.8)

gasoline_training <- gasoline_split %>% 
  training()

gasoline_testing <- gasoline_split %>% 
  testing()

gasoline_cv <- vfold_cv(gasoline_training) # to tune number of components later

Modelling

# recipe

gasoline_reciped <- recipe(octane ~ ., data = gasoline_training) %>% 
  update_role(octane, new_role = "outcome") %>% 
  step_normalize(all_predictors())

gasoline_reciped
Data Recipe

Inputs:

      role #variables
   outcome          1
 predictor        401

Operations:

Centering and scaling for all_predictors()
# fit model

pls_model <- plsmod::pls(num_comp = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("mixOmics")

# put into workflow

pls_workflow <- workflow() %>% 
  add_recipe(gasoline_reciped) %>% 
  add_model(pls_model)

# create grid
pls_grid <- expand.grid(num_comp = seq (from = 1, to = 20, by = 1))

tuned_pls_results <- pls_workflow %>% 
  tune_grid(resamples = gasoline_cv,
            grid = pls_grid,
            metrics = metric_set(mae, rmse, rsq))

(model_results <- tuned_pls_results %>% 
  collect_metrics())
# A tibble: 60 x 7
   num_comp .metric .estimator  mean     n std_err .config            
      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>              
 1        1 mae     standard   1.15     10 0.122   Preprocessor1_Mode…
 2        1 rmse    standard   1.29     10 0.137   Preprocessor1_Mode…
 3        1 rsq     standard   0.371    10 0.104   Preprocessor1_Mode…
 4        2 mae     standard   0.573    10 0.0745  Preprocessor1_Mode…
 5        2 rmse    standard   0.691    10 0.0876  Preprocessor1_Mode…
 6        2 rsq     standard   0.716    10 0.0874  Preprocessor1_Mode…
 7        3 mae     standard   0.212    10 0.0280  Preprocessor1_Mode…
 8        3 rmse    standard   0.259    10 0.0289  Preprocessor1_Mode…
 9        3 rsq     standard   0.972    10 0.00967 Preprocessor1_Mode…
10        4 mae     standard   0.187    10 0.0177  Preprocessor1_Mode…
# … with 50 more rows

Visualize:

# plot

tuned_pls_results %>% 
  collect_metrics() %>% 
  ggplot(aes(num_comp, mean, col = .metric)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(n.breaks = 20) +
  labs(x = "Number of components",
       y = "Indicator",
       title = "Plot of MAE, RMSE and R-SQ vs number of components for TRAINING dataset, with 10-fold repeated cross validation",
       subtitle = "Optimal number of components is 3") +
 facet_grid(.metric ~.) +
  theme_few() +
  theme(legend.position = "none")

Check against numerical values:

# check against numerical values
tuned_best <- tuned_pls_results %>% 
  select_best("rsq") 

tuned_best
# A tibble: 1 x 2
  num_comp .config              
     <dbl> <chr>                
1        4 Preprocessor1_Model04
# check with other indicators
tuned_pls_results %>% 
  select_best("mae") 
# A tibble: 1 x 2
  num_comp .config              
     <dbl> <chr>                
1        4 Preprocessor1_Model04
tuned_pls_results %>% 
  select_best("rmse")
# A tibble: 1 x 2
  num_comp .config              
     <dbl> <chr>                
1        4 Preprocessor1_Model04

In this case, I will go for num_comp = 4. The aim of PLS is to reduce number of variables, but there would be a slight trade-off in terms of accuracy/error in favor of simplicity of model when determining the number of components.

Update model and workflow:

updated_pls_model <-  plsmod::pls(num_comp = 4) %>% 
  set_mode("regression") %>% 
  set_engine("mixOmics")

updated_workflow <- pls_workflow %>% 
  update_model(updated_pls_model)
  

pls_fit <- updated_workflow %>% 
  fit(data = gasoline_training)

Check the most important X variables for the updated model:

# check the most important predictors

tidy_pls <- pls_fit %>% 
  pull_workflow_fit() %>% 
  tidy()


# variable importance
tidy_pls %>% 
  filter(term != "Y", # outcome variable col name
         component == c(1:4)) %>% 
  group_by(component) %>% 
  slice_max(abs(value), n = 20) %>% 
  ungroup() %>% 
  ggplot(aes(value, fct_reorder(term, value), fill = factor(component))) +
  geom_col(show.legend = F) +
  facet_wrap( ~ component, scales = "free_y") +
  labs( y = NULL) +
  theme_few()

Assess

# results_train
pls_fit %>% 
  predict(new_data = gasoline_training) %>% 
  mutate(truth = gasoline_training$octane) %>% 
  ggplot(aes(truth, .pred)) +
  geom_point() +
  geom_abline() +
  labs(title = "Actual vs Predicted for TRAINING dataset",
       x = "Actual Octane Number",
       y = "Predicted Octane Number") +
  theme_few()
pls_fit %>% 
  predict(new_data = gasoline_training) %>% 
  mutate(truth = gasoline_training$octane) %>% 
  metrics(truth = truth, estimate = .pred) %>% 
  as_tibble() # rsq is 0.981 for training dataset
# A tibble: 3 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.207
2 rsq     standard       0.981
3 mae     standard       0.168
# results_test
pls_fit %>% 
  predict(new_data = gasoline_testing) %>% 
  mutate(truth = gasoline_testing$octane) %>% 
  ggplot(aes(truth, .pred)) +
  geom_point(col = "deepskyblue3") +
  geom_abline(col = "deepskyblue3") +
  labs(title = "Actual vs Predicted for TESTING dataset",
       x = "Actual Octane Number",
       y = "Predicted Octane Number") +
  theme_few()
pls_fit %>% 
  predict(new_data = gasoline_testing) %>% 
  mutate(truth = gasoline_testing$octane) %>% 
  metrics(truth = truth, estimate = .pred) %>% 
  as_tibble() # r-sq is 0.989 for testing dataset
# A tibble: 3 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.222
2 rsq     standard       0.983
3 mae     standard       0.151

To predict future data

In this case, I will just take the first dataset and check if the predicted value is the actual value. This should not be done, but I did it just to understand how the workflow is.

trial_data <- gasoline_tidy %>% 
  head((1)) %>% 
  dplyr::select(-octane) # octane = 85.3

pls_fit %>% 
  predict(trial_data) # 85.3
# A tibble: 1 x 1
  .pred
  <dbl>
1  85.2
# actual value = predicted value

Reflections

Through this exercise, I learnt:

References

Citation

For attribution, please cite this work as

lruolin (2021, March 9). pRactice corner: Tidy Models - Regression. Retrieved from https://lruolin.github.io/myBlog/posts/20210309_tidy models plsr gasoline/

BibTeX citation

@misc{lruolin2021tidy,
  author = {lruolin, },
  title = {pRactice corner: Tidy Models - Regression},
  url = {https://lruolin.github.io/myBlog/posts/20210309_tidy models plsr gasoline/},
  year = {2021}
}